residual stream
- Europe > Germany > Saarland (0.04)
- Asia > Philippines (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
- Leisure & Entertainment (0.45)
- Information Technology (0.45)
- Education (0.45)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Taiwan (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- North America > United States (0.14)
- Oceania > Australia (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Software > Programming Languages (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)
- Overview (0.68)
- Research Report (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Transformers Represent Belief State Geometry in their Residual Stream
What computational structure are we building into large language models when we train them on next-token prediction? Here, we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. Leveraging the theory of optimal prediction, we anticipate and then find that belief states are linearly represented in the residual stream of transformers, even in cases where the predicted belief state geometry has highly nontrivial fractal structure. We investigate cases where the belief state geometry is represented in the final residual stream or distributed across the residual streams of multiple layers, providing a framework to explain these observations. Furthermore we demonstrate that the inferred belief states contain information about the entire future, beyond the local next-token prediction that the transformers are explicitly trained on. Our work provides a general framework connecting the structure of training data to the geometric structure of activations inside transformers.
The Impossibility of Inverse Permutation Learning in Transformer Models
Alur, Rohan, Hays, Chris, Raghavan, Manish, Shah, Devavrat
In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > Mexico > Gulf of Mexico (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
Circuits, Features, and Heuristics in Molecular Transformers
Varadi, Kristof, Marosi, Mark, Antal, Peter
Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Materials > Chemicals > Commodity Chemicals (0.46)
Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Kapl, Ferdinand, Angelis, Emmanouil, Höppe, Tobias, Maile, Kaitlin, von Oswald, Johannes, Scherrer, Nino, Bauer, Stefan
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
- Europe > Austria > Vienna (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (5 more...)